In these exercises you will use some of the functions from the tidyr package to make datasets tidy.

In several of our exercises (incl. this one) we will use data on global life expectancy from Gapminder and the Titanic dataset from Kaggle. In addition, for one of the exercises on tidy data, we will use an excerpt from NationMaster data on murder and intentional homicide for 2010.

First, copy the following code into a new R script and run it to load/generate the datasets we will use in these exercises. For the moment, you do not have to understand all all of the following code (but you should be able to at the end of this workshop). Please note that in order for the code to run, your working directory should be the folder with the workshop materials (use getwd to print your current working directory if you are unsure and setwd to change it if necessary).

library(tidyverse)

gap_life <- read_csv("../data/gapminder/life_expectancy_years.csv")
titanic <- read_csv("../data/titanic/titanic.csv")

crime <- tibble(country = rep(c("Germany", "Brazil", "Norway"), 2),
                  crime = c(rep("murders", 3), rep("intentional homicide rate", 3)),
                  year = 2010,
                  value = c(690, 40974, 29, 0.84, 27, 0.68))

1

Have a look at the gap_life dataset. What do you notice?
The dataset is in wide format with each year as a column/variable. To increase human-readability and to facilitate further data wrangling and exploration steps, it makes sense to transform it into long format.

2

Transform the gap_life dataset into a sensible long format.

You should gather the years into one column/variable. If you are unsure about the arguments of a function, you can always consult the help files by typing (and running) a ? directly followed by the function name (e.g., ?glimpse). NB: This only works if you have previously loaded the package that includes the function.

Be aware that there is one column in the dataset that you do not want to gather. You can specify this by adding -variablename as the second argument (or first if you use pipes) to the gather() function.
gap_life_long <- gap_life %>% 
  gather(-country, key = "year", value = "life_exp")

gap_life_long
## # A tibble: 40,953 x 3
##    country             year  life_exp
##    <chr>               <chr>    <dbl>
##  1 Afghanistan         1800      28.2
##  2 Albania             1800      35.4
##  3 Algeria             1800      28.8
##  4 Andorra             1800      NA  
##  5 Angola              1800      27  
##  6 Antigua and Barbuda 1800      33.5
##  7 Argentina           1800      33.2
##  8 Armenia             1800      34  
##  9 Australia           1800      34  
## 10 Austria             1800      34.4
## # ... with 40,943 more rows

3

Have a look at the crime dataset. What do you notice?
Observations are spread across two rows. The intepretation of the value column depends on the value of crime.

4

Tidy the crime dataset, so that there is only one observation/row for each country.
You should spread the crime variable.
crime_spread <- crime %>% 
  spread(key = "crime", value = "value")

crime_spread
## # A tibble: 3 x 4
##   country  year `intentional homicide rate` murders
##   <chr>   <dbl>                       <dbl>   <dbl>
## 1 Brazil   2010                       27      40974
## 2 Germany  2010                        0.84     690
## 3 Norway   2010                        0.68      29

5

Split the Name variable in the titanic dataset into two variables: one that contains only the last name, and one that contains the first name(s) plus title (Mr., Mrs., Dr., etc.).
When looking at the Name variable in the titanic dataset you should notice that the last name is separated from the rest of the name by a comma plus a space.
titanic <- titanic %>% 
  separate(Name, c("last_name", "first_name"), sep =", ")

titanic
## # A tibble: 891 x 13
##    PassengerId Survived Pclass last_name first_name Sex     Age SibSp Parch
##          <dbl>    <dbl>  <dbl> <chr>     <chr>      <chr> <dbl> <dbl> <dbl>
##  1           1        0      3 Braund    Mr. Owen ~ male     22     1     0
##  2           2        1      1 Cumings   Mrs. John~ fema~    38     1     0
##  3           3        1      3 Heikkinen Miss. Lai~ fema~    26     0     0
##  4           4        1      1 Futrelle  Mrs. Jacq~ fema~    35     1     0
##  5           5        0      3 Allen     Mr. Willi~ male     35     0     0
##  6           6        0      3 Moran     Mr. James  male     NA     0     0
##  7           7        0      1 McCarthy  Mr. Timot~ male     54     0     0
##  8           8        0      3 Palsson   Master. G~ male      2     3     1
##  9           9        1      3 Johnson   Mrs. Osca~ fema~    27     0     2
## 10          10        1      2 Nasser    Mrs. Nich~ fema~    14     1     0
## # ... with 881 more rows, and 4 more variables: Ticket <chr>, Fare <dbl>,
## #   Cabin <chr>, Embarked <chr>